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CHAPTER 2. SUFFICIENCY AND ESTIMATION 

2.1 Sufficient statistics. Classically, a "statistic" is a measurable function of the ob- 
servations, say /(Xl,... ,X n ). The concept of "statistic" differs from that of "random 
variable" in that a random variable is defined on a probability space, so that one prob- 
ability measure is singled out, where for statistics we have in mind a family of possible 
probability measures, 

R. A. Fisher (1922) called a statistic sufficient "when no other statistic which can be 
calculated from the same sample provides any additional information as to the value of 
the parameter to be estimated." For example, given a sample (Xi, . . . , X n ) where the Xj 
are i.i.d. N(9,l), the sample mean X := (X± + •••X n )/n turns out to be a sufficient 
statistic for the unknown parameter 9. J. Neyman (1935) gave one form of a "factorization 
theorem" for sufficient statistics. This section will give more general forms of the definitions 
and theorem due to Halmos and L. J. Savage (1949). 

In general, given a measurable space (S, £>), that is a set S with a cr-algebra B of 
subsets, and another measurable space (Y, a statistic is a measurable function T from 
S into Y. Often, Y = R or a Euclidean space W 1 with Borel a-algebra. 

Let T" 1 ^) := {T _1 (F) : F G F}. Then T _1 (.P) is a a-algebra and is the smallest 
cr-algebra on S for which T is measurable. 

For any measure \i on £>, C 1 ^) := £ 1 (S', B,fS) denotes the set of all real- valued, 
immeasurable, //-integrable functions on S. For any probability measure P on (S, B), sub- 
a-algebra A C B, and / G ^(P), we have a conditional expectation Ep(f\A), defined 
as a function g G £ 1 (5', ^4, P), so that g is measurable for ^4, such that for all A G A, 
Ia9^P = J A fdP. Such a # always exists, as a Radon-Nikodym derivative of the signed 
measure A f A f dP with respect to the restriction of P to A (RAP, Theorems 5.5.4, 
10.1.1). If / = 1 B for some B G B let P(B\A) := P P (1 B |„4). Then almost surely 
< P(P|*4.) < 1. Conditional expectations for P are only defined up to equality P-almost 
surely. Now, we will need the fact that suitable ^4-measurable functions can be brought 
outside conditional expectations (RAP, Theorem 10.1.9), much as constants can be brought 
outside ordinary expectations: 

2.1.1 Lemma. Ep(fg\A) = fEp(g\A) whenever both g and fg are in £ 1 (5', £>, P) and 
/ is ^4-measurable. 

Now, here are precise definitions of sufficiency: 

Definition. Given a family V of probability measures on £>, a sub-cr-algebra A C B is 
called sufficient for V if and only if for every B G B there is an ^.-measurable function 
fp > such that fp = P(B\A) P-almost surely for every P G V. A statistic T from (S, B) 
to (YjJ 7 ) is called sufficient if and only if T _1 (jF) is sufficient. 

The cr-algebra A is pairwise sufficient for P iff for every P, Q G P the likelihood ratio 
Rq/p is ^.-measurable; more precisely, Rq/p can be chosen to be ^4- measurable, and any 
choice of Rq/p is equal (P + Q)-almost everywhere to such an ^4-measurable function. 

If, for example, A is finite, generated by a finite partition, and is sufficient for P, then 
on each atom A of A, fp, being A- measurable, must be constant, and 
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Q A (B) := / f B dP/P(A) = P(ADB)/P(A) = P(B\A) 

J A 



is the same for all P G V with P{A) > 0. Or if P{A) = for all P G V but A is non-empty, 
choose any y G A and let Qa '■= Sy (point mass at y). In either case, Qa is a probability 
measure concentrated on A such that for all B G 23 and PgP, P(P fl A) = Q^(S)P(A). 
This is a simple case of "factorization" into a product where one factor depends on B but 
not P and the other on P but not P. Then to estimate P, once we know x is in the atom 
A of ^4, it can give no further information about P in V, since each P in P conditioned 
on A reduces to the one law Qa- Also, for any fi and z/ in P, the likelihood ratio P^/ M is 
constant on the atom A. 

Some facts on measurable functions will be needed. For a a-algebra S and set A (not 
necessarily in S) let Sa be the "relative" cx-algebra {C<1 A : C G S}. For any functions /, g 
such that / is defined on the range of g let fog denote the composition, fog(x) := f(g(x)) 
for all x in the domain of g. Recall (RAP, Theorems 4.2.5, 4.2.8): 

2.1.2 Theorem. Let (Y,S) be any measurable space and A an arbitrary subset of Y. Let 
/ be any real- valued function on A measurable for Sa- Then there is an iS-measurable 
extension of / to all of Y. 

2.1.3 Theorem. Given a set S, a measurable space (Y,F), and a function T from S into 
Y, a real- valued function / on S is T~ l {P) measurable if and only if / = g o T for some 
jF-measurable function g. 

Theorem 2.1.2 is immediate if A is in <S, since / can be given any constant value on 
the complement of A. But sets A not in S can arise: for example, the range of a Borel 
measurable function T : R — > R need not be Borel (RAP, Theorem 13.2.5). 

Recall, as in Section 1.3, that if a family P = {P#, 9 G © } of laws is dominated by 
a cr-finite measure /U, meaning that every P G P is absolutely continuous with respect to 
//, then there exist Radon-Nikodym derivatives dP/d/i, which give us likelihood functions 
f(9,x) := (dPe/d/j,)(x), 9 G 0, x G S, and allow us to write in integrals dPo(x) = 
f(9, x)d(i(x). 

Neyman (1935), Halmos and Savage (1949), and Bahadur (1954) proved the equiva- 
lence of (a) and (b) for dominated families in the following theorem. 

2.1.4 Theorem (Factorization theorem). Let (S,B) be a measurable space, A a sub-cr- 
algebra of £>, and P a non-empty family of probability measures on B. 

(I) If P is dominated by a cr-finite measure the the following are equivalent: 

(a) A is sufficient for P; 

(b) there is a £>-measurable function h > such that for all P G P, there is an 
^4-measurable function fp with dP/dfx = fph. We can take h G £ 1 (<S', £>, fx). 

(c) A is pairwise sufficient for P. 

(II) In general, if P is not necessarily dominated, (a) always implies (c). 

Theorem 2.1.4 has the following consequence, via Theorem 2.1.3: 
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2.1.5 Corollary. Let V ={Pe, 6> G 6} be a non-empty family of laws on the sample space 
(£,£>), dominated by a a-finite measure /j,, and let T be a statistic on (S,B). Then the 
following are equivalent: 

(a') T is sufficient for V; 

(V) The likelihood functions f{Q,x) can be written as G{Q,T{x))h{x), where for each 
6, G(6, •) is a measurable function from Y to IR, and h is a measurable function on (S, B); 

(c') For any P,Q E V, the likelihood ratio Rq/p can be written as a measurable 
function of T(x). 

Thus, when the parameter space contains just two points, giving laws P and Q, the 
likelihood ratio Rq/p is a sufficient statistic, since Rp/Q = 1/Rq/p- Before beginning 
the proofs, here is an example. Consider the family of gamma distributions with scale 
parameter equal to 1, namely, let fi be Lebesgue measure on (0, oo) and let f(0,x) := 
x e ~ 1 e~ x /T(6) for x > and for x < 0, where < 6 < oo. Then the likelihood function 
for n i.i.d. observations Xi, ...,X n is 

Vf^mXj) = (X 1 X 2 ---X n )°- 1 e X p(£,j=iXj) F(B) n . 

It follows from Corollary 2.1.5 that X1X2 • • ■ X n is a sufficient statistic for the family. The 

function h(x) := exp(Xi H h X n ) in this case doesn't depend on 6>, thus it divides out 

in forming likelihood ratios. 

A first step in the proof of Theorem 2.1.4, that (b) implies (c) in (I), is easy: for 
any P,Q E V, (b) implies that we can take Rq/ p (x) = (fQ(x)h(x))/(fp(x)h(x)) = 
fq(x)/ fp(x), which is ^4-measurable. We use the usual conventions for likelihood ra- 
tios, y/0 := +00 if y > and if y = 0. From (b), P({x : h(x) = 0}) = for all P E V, 
so h(x)/h(x) does not become indeterminate except on negligible sets. 

To continue the proof of Theorem 2.1.4 we will need some facts about dominated 
families. Two measures are called equivalent if each is absolutely continuous with respect 
to the other. This notion extends to families of laws as follows. 

Definition. Two families V and Q of probability measures on a a-algebra B are called 
equivalent if and only if for all B E £>, (P(B) = for all P E V) is equivalent to (Q(B) = 
for all Q E Q). If Q consists of just one law Q we will say V is equivalent to Q. 

2.1.6 Lemma. Assume that a non-empty family V of probability measures is dominated 
by a cr-finite measure \i. Then: 

(a) V is dominated by a probability measure fx' . 

(b) V is equivalent to some probability measure (j,q. 

(c) There is a countable subfamily {Qk}k>i of V equivalent to V. 

(d) V is equivalent to a probability measure v given by v = J2^ =1 Qk/2 h for some 
sequence {Qk}k>i C V. 

Proof. Part (d) implies the other parts, which are steps in its proof. 

(a): If fi(S) < 00 let /j'(B) := fi(B)/fi(S) for all B E B. Otherwise, S = [jf =1 A t 
where < n(Ai) < 00 for each i and n Aj =0 for i 7^ j. For each B E B let 
n'(B) := X^iM-^ n ^-i)/(2V(^-j))- Then in either case ji' is a probability measure 
equivalent to fx, and both dominate V. 
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(b) : Take fi = n' from (a). Let 7 := sup{//(C) : P(C) = for all P G V}. 
Take C % G B with P(d) = for all P G V and ^(Q) | 7. Let C := \J t Ci. Then 
//(C) = 7 and P(C) = for all P £ V. We have n(C c ) > since /U dominates P. Let 
Ho(B) := fj,(B fl C c )/ / u(C c ), a probability measure. Then clearly ^0 dominates P. On 
the other hand if P(A) = for all P G V then Ho(A) = 0, otherwise we could replace C 
by C U A and violate the definition of 7. So ^0 is equivalent to V. 

(c) : Take fj, = /i from (b). For any {P i } l > 1 C P let 7({Pji>i) := sup{/i(C) : 
Pi(C) = for all i}- Let a := mf^dP}^) : {P l } l > 1 C P}. For j = 1,2,..., 
take {Pji}i>i such that 7({^jJ;>i) < " + Let {Qfc}fc>i = {Pji}j>i,i>i- Then 
7({Qfc}fc>i) = «• By the proof of (b), for some C G £>, £t(C) = a and Qk(C) = for 
all fc. It will be shown that a = 0. If not, a > 0. Then by (b), for some P G V, 
P{C) > 0. Letting Q := P we have 7({Qfc}fc>o) < e C : (dQ /d^){x) = 0}) < a, 
a contradiction. So a = and P is equivalent to {<5fe}fc>i- 

It is straightforward that (c) implies (d), so the Lemma is proved. □ 

Now to continue the proof of Theorem 2.1.4, we have already proved (b) implies (c) 
in (I). To prove (c) implies (a) and (b), take v = Y^h=iQk/^ k from Lemma 2.1.6(d). 
For any P G P, we have R v /p = YlkLi ^QuIpI^i which is -4-measurable. Thus, so is 
dPjdv = Rp/ V = l/R u /p, with the usual conventions l/oo := 0, 1/0 := +00. Thus we 
can write 

dP dP dv 
dfi dv d]! 1 

which gives (b) since dP/dv is „4-measurable, with h := dv/dji G 

Now to prove (a), for each B G B, let f B := v(BA) := E U (1 B \A). For any P G P, 
since dPjdv is ^4-measurable we have by Lemma 2.1.1 that 

Using this and the definition of conditional expectation twice, we have for any A G A, 

f B dP= f f B d fdu= [ E v (^l B \A]dv= [ l B d fdu= [ l B dP. 
A J A dv J A \dv J J A dv J A 

Thus f B satisfies the definition of P{B\A), proving (a). 

(a) implies (b): Again take v from Lemma 2.1.6(d). For any B G B take f B from 
(a). Since < f B < 1 P-almost surely for all P in P, also < f B < 1 almost surely for 
v. We have Qk(B n A) = J A f B dQ k for all k and all A G A, so v(B n A) = J A f B dv so 
f B = E u {l B \A). 

To show that for each P G P, dP/dv is ^4-measurable, note that for each B in £>, by 
the definition of likelihood ratio and since S G A), 
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dP dv = P{B) = I f B dP = [ f B ^dv = [ E u (f B ^\A)dv, 



dv J dv \ dv 
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which by Lemma 2.1.1 twice equals 



E V (1 B \A)E V (^-\a) dv = ( E v (l B E v (^\a) \a] dv = [ \ B E V (^-\a) dv. 



\dv J J \ \av j j J \dv 

Since dPjdv and E v [dP j dv\A) have the same integral over all sets in £>, they must be 
equal P-a.s., so that dP/dv is equal P-a.s. to an ^4-measurable function. So (a) implies 
(b), again with h = dv/dfi E C 1 ^) since v is a probability measure, and part (I) of 
Theorem 2.1.4 is proved. 

For (II), to show (a) implies (c) in general, if V is not necessarily dominated, take any 
P, Q E V. Then {P, Q} is dominated, e.g. by (P + Q)/2, and A is sufficient for it, so by 
(a) implies (c) in the dominated case, we get that (c) holds. Theorem 2.1.4 is proved. □ 

The next fact gives an example showing that pairwise sufficiency does not imply 
sufficiency in general. 

2.1.7 Proposition. There exists a family of laws V on a sample space (S,B) and a 
sub-cr-algebra A C B which is pairwise sufficient for V but not sufficient. 

Proof. Let S = [0, 1] with Borel a-algebra B. Let A be the sub-cr-algebra of countable 
sets and their complements. Let V be the family of all purely atomic laws on S. Thus 
P G V if and only if P(M) = 1 for some countable set M. Let P, Q G V and take a 
countable A C S such that P(A) = Q(A) = 1. Defining R Q / P (x) := for x A, it is 
clear that Rq/p is ^.-measurable, so A is pairwise sufficient for V. 

Let B := [0, 1/2]. For any P G V and the countable set A := {x : P({x}) > 0}, 
we must have Ep(1b\A)(x) = 1_b(x) for x G A. If has the property in the definition 
of sufficiency then fsix) = Ib(x) for all x, since for all x there is some P E V with 
P({x}) > 0. But Is is not ^-measurable, so A is not sufficient for V. □ 

Note that in the example in the last proof, V is dominated by a measure c, namely 
counting measure on [0, 1], and derivatives dP/dc exist and are A- measurable, but c is not 
cr-finite. 

Sufficiency, defined in terms of conditional probabilities of measurable sets, can be 
extended to suitable conditional expectations: 

2.1.8 Theorem. Let A be sufficient for a family V of laws on a measurable space (S 1 , B). 
Then for any measurable real- valued function / on (5, B) which is integrable for each 
P G V, there is an ^4-measurable function g such that g = Ep(f\A) a.s. for all P EV. 

Proof. When / is the indicator function of a set in £>, the assertion is the definition of 
sufficiency. It then follows for any simple function, which is a finite linear combination of 
such indicators. If / is nonnegative, there is a sequence of nonnegative simple functions 
increasing up to / and the conclusion follows (RAP, Proposition 4.1.5 and Theorem 10.1.7). 
Then any / satisfying the hypothesis can be written as / = f + — f~ for / + and f~ 
nonnegative and the result follows. □ 

Rather generally, for a sufficient cr-algebra A, the ^4-measurable decision rules are an 
essentially complete class. The idea is that for any decision rule d, since conditional proba- 
bilities P(-\A)(x) don't depend on P G P, given x, we choose y at random in S according to 
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this conditional distribution, then take decision d(y), thus giving an A- measurable random- 
ized decision rule e(x) which has the same risk function as d. For conditional probabilities 
allowing such a choice to be made we will need some regularity conditions such as the 
following. 

A measurable space (S, B) will be called standard if there is a 1-1 function, measurable 
with measurable inverse, from 5 onto a Borel set in a complete separable metric space or, 
equivalently, onto a complete separable metric space (RAP, Sec. 13.1). Most sample spaces 
considered in statistics are standard, where the Borel set is an open or closed subset of 
a Euclidean space R. k . On a class V of laws on (S,B), recall that as defined just before 
(1.2.3), S& is the smallest a-algebra making the functions P i— > P(B) measurable for all 



2.1.9 Theorem. Let V be a class of probability measures on a standard measurable space 
(S, £>), dominated by a a-finite measure and A a sub-a-algebra sufficient for V . Then 
for any action space (A,S), any measurable loss function L on V x A, and any (possibly 
randomized) decision rule d, there exists a decision rule e, measurable on (S, A), with risk 
r(P, e) = r(P, d) for all P G V . So, the *4.-measurable decision rules form an essentially 
complete class. 

Proof. The conditional probability P(B\A)(x) for P G V , B G B and x G X is initially 
defined for each fixed B up to P-almost sure equality with respect to x. It will be called 
a regular conditional probability if for each B G £>, P(B\A)(-) is a specific ^4-measurable 
function such that for each x, P(-\A)(x) is a countably additive probability measure on 
B. Regular conditional probabilities exist in this case (RAP, Theorem 10.2.2, with Y = 
the identity and A, C there = B, A here respectively). Take v from Lemma 2.1.6(d) and 
a regular conditional probability v(-\A)(-). By sufficiency of A and the proof of Theorem 
2.1.4, (c) =>- (a), for any P G V, v{-\A){-) is also a regular conditional probability for P, 
so we can write it as P(-\A){-). Given a randomized decision rule d, with values in the set 
of probability measures on (A,£), let e be such a decision rule where for each F G £ 
and x G S, 



and for a measure m, m(dy) := dm(y). In other words the function y i— > d(y)(F), which 
is assumed immeasurable, is integrated with respect to the probability measure P(-\A)(x). 
So the integral is well-defined. For fixed P, e(x)(F) is an ^4-measurable function of x, 
in fact equal to E(d(-)(F)\A) (RAP, Theorem 10.2.5). For each x, e(x)(-) is a countably 
additive probability measure on £ since d(y)(-) is for each y and P(-\A)(x) is a probability 
measure. So e(-) is a well-defined A- measurable randomized decision rule, and for each 
P G V and x G S, 



B £ B. 





so 
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r(P,e) = [ [ [ L(P,a)d(y)(da)P(dy\A)(x)dP(x). 
JxeS JyeS JaeA 

Now for any / G ^(P), by RAP, Theorem 10.2.5 again, 

/ / f(y)P(dy\A)(x)dP(x) = [ E P (f\A)(x)dP(x) = [ f(x)dP(x). 
JxeS JyeS J J 

Omitting the middle term, the equation holds for any measurable / > 0, taking inte- 
grate /„ := min(/,n) with < /„ | /. Applying this to f(u) := f aeA L(P,a)d(u)(da), 
we get r(P, e) = r(P, d). Since P was any member of V, the proof is complete. □ 

Examples. In each of the following 5 = IR n , B = the Borel a-algebra, and V is a set 
V(n, M) of distributions (i n , in other words laws for which the coordinates 
i.i.d. with distribution and where fx G M., a set of laws on R. In each case T is a 
sufficient statistic. 

(i) (Binomial family). Let M. = {(1 — p)5o + pSi : < p < 1} and T(xi,... ,x n ) = 
xi H hi n . 

(ii) (Poisson family). Let .M = {Pa : < X < oo} where P\(&) = e~ x \ k /k\ for 
/c = 1, 2, ... , and P\(B) = whenever 5 contains no nonnegative integers. Again let 

TyX\j ••• 7 Xn) — X\ ~\- • • • ~\- X n . 

(hi) (Normal laws with variable location and fixed variance). Let M. = {N{6, 1) : 6 G M}, 
with the same T. 

(iv) (Fixed location and variable scale). Let M. = {N(0 : a 2 ) : < a 2 < oo}, and 
T(x\ , . . . , x n ) = X/j=i • 

(v) (Order statistics). Let M. be the set of all Borel probability measures on R and let 
T(x\,... ,x n ) := ,X( n )) where ,^( n ) are Xi, . . . ,x n arranged into 
non-decreasing order, < X( 2 ) < • • • < ^(n)- 

PROBLEMS 

1. Show that the given statistics are, in fact, sufficient for the binomial and Poisson 
families. 

2. Likewise for the normal location and normal scale families. 

3. Show that the n-tuple of order statistics is pairwise sufficient for the family of all laws 
P n on R n where P runs over the class of all laws on R. (It is actually sufficient, but 
that is harder to prove, see problems 5 and 6.) 

4. For — oo < a < b < oo let U[a, b] be the uniform distribution on the interval [a, 6], with 
density 1/(6 — a) on [a, b] and elsewhere. Let U be the family of all such laws. Show 
that for the set of all laws P n on IR n for P G W, (x(i), £( n )) 5 the minimum and maximum 
of the observations, provide a sufficient statistic (with values in IR 2 ). 

5. Let (S, B) be a sample space. Let Vo be the family of all laws on (£, B). Let S n be the 
Cartesian product of n copies of S with product cr-algebra B n . Let V := {P n : P E Vo} 
be the set of all laws on S n for which the coordinates Xl, X n are i.i.d. Let S n be the 
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cr-algebra of all sets in B n invariant under all permutations of coordinates. Show that 
S n is sufficient for V. Hint: For any tt in the set fl n of all permutations of {1, ...,n}, 
and x := (xi, x n ), let /^(x) := (2^(1), ...,x n ^). Show that for any £? G £> n and 
PGP, P(B\S n )(x) = ± E. 6 n„ !*(/*(*)). 
6. If S = R in the previous problem, with Borel a-algebra, show that the vector T := 
(#(!), ...,#(„)) of order statistics is a sufficient statistic. Hint: show that S n is the 
smallest a-algebra for which T is measurable. 

NOTES 

Fisher (f 922) invented the idea of sufficiency and Neyman (f 935) gave a first form of 
factorization theorem. Halmos and Savage (f949) proved the factorization theorem (2.f .4) 
in case h G ^(S, £>, y). Bahadur (f954) removed the restriction on h. 

Regarding measurability of the ranges of measurable functions, as in Theorem 2.f .3, 
Darst (f97f) showed that even an infinitely differentiable (C°°) function can take a Borel 
set onto a non-Borel set. 
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